{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "bFWbEb6uGbN-"
   },
   "source": [
    "# Week 4: Predicting the next word\n",
    "\n",
    "Welcome to this assignment! During this week you saw how to create a model that will predict the next word in a text sequence, now you will implement such model and train it using a corpus of [Shakespeare Sonnets](https://www.opensourceshakespeare.org/views/sonnets/sonnet_view.php?range=viewrange&sonnetrange1=1&sonnetrange2=154), while also creating some helper functions to pre-process the data."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "#### TIPS FOR SUCCESSFUL GRADING OF YOUR ASSIGNMENT:\n",
    "\n",
    "- All cells are frozen except for the ones where you need to submit your solutions or when explicitly mentioned you can interact with it.\n",
    "\n",
    "\n",
    "- You can add new cells to experiment but these will be omitted by the grader, so don't rely on newly created cells to host your solution code, use the provided places for this.\n",
    "- You can add the comment # grade-up-to-here in any graded cell to signal the grader that it must only evaluate up to that point. This is helpful if you want to check if you are on the right track even if you are not done with the whole assignment. Be sure to remember to delete the comment afterwards!\n",
    "- Avoid using global variables unless you absolutely have to. The grader tests your code in an isolated environment without running all cells from the top. As a result, global variables may be unavailable when scoring your submission. Global variables that are meant to be used will be defined in UPPERCASE.\n",
    "\n",
    "- To submit your notebook, save it and then click on the blue submit button at the beginning of the page.\n",
    "\n",
    "Let's get started!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "BOwsuGQQY9OL",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "import numpy as np \n",
    "import matplotlib.pyplot as plt\n",
    "import tensorflow as tf\n",
    "import pickle"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "import unittests"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "## Defining some useful global variables\n",
    "\n",
    "Next you will define some global variables that will be used throughout the assignment. Feel free to reference them in the upcoming exercises:\n",
    "\n",
    "- `FILE_PATH`: The file path where the sonnets file is located. \n",
    "\n",
    "- `NUM_BATCHES`: Number of batches. Defaults to 16.\n",
    "- `LSTM_UNITS`: Number of LSTM units in the LSTM layer.\n",
    "- `EMBEDDING_DIM`: Number of dimensions in the embedding layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "FILE_PATH = './data/sonnets.txt'\n",
    "NUM_BATCHES = 16\n",
    "LSTM_UNITS = 128\n",
    "EMBEDDING_DIM = 100"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**A note about grading:**\n",
    "\n",
    "**When you submit this assignment for grading these same values for these globals will be used so make sure that all your code works well with these values. After submitting and passing this assignment, you are encouraged to come back here and play with these parameters to see the impact they have in the classification process. Since this next cell is frozen, you will need to copy the contents into a new cell and run it to overwrite the values for these globals.**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "### Reading the dataset\n",
    "\n",
    "For this assignment you will be using the [Shakespeare Sonnets Dataset](https://www.opensourceshakespeare.org/views/sonnets/sonnet_view.php?range=viewrange&sonnetrange1=1&sonnetrange2=154), which contains more than 2000 lines of text extracted from Shakespeare's sonnets."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "Pfd-nYKij5yY"
   },
   "outputs": [],
   "source": [
    "# Read the data\n",
    "with open(FILE_PATH) as f:\n",
    "    data = f.read()\n",
    "\n",
    "# Convert to lower case and save as a list\n",
    "corpus = data.lower().split(\"\\n\")\n",
    "\n",
    "print(f\"There are {len(corpus)} lines of sonnets\\n\")\n",
    "print(f\"The first 5 lines look like this:\\n\")\n",
    "for i in range(5):\n",
    "  print(corpus[i])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "imB15zrSNhA1"
   },
   "source": [
    "## Exercise 1: fit_vectorizer\n",
    "\n",
    "In this exercise, you will use the [tf.keras.layers.TextVectorization layer](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) to tokenize and transform the text into numeric values. \n",
    "\n",
    "Note that in this case you will not pad the sentences right now as you've done before, because you need to build the n-grams before padding, so pay attention with the appropriate arguments passed to the TextVectorization layer!\n",
    "\n",
    "**Note**:\n",
    "- You should remove the punctuation and use only lowercase words, so you must pass the correct argument to TextVectorization layer.\n",
    "\n",
    "- In this case you will not pad the sentences with the TextVectorization layer as you've done before, because you need to build the n-grams before padding. Remember that by default, the TextVectorization layer will return a Tensor and therefore every element in it must have the same size, so if you pass two sentences of different length to be parsed, they will be padded. If you do not want to do that, you need to either pass the parameter ragged=True, or pass only a single sentence at the time. Later on in the assignment you will build the n-grams and depending on how you will iterate over the sentences, this may be important. If you choose to first pass the entire corpus to the TextVectorization and then perform the iteration, then you should pass ragged=True, otherwise, if you use the TextVectorization on each sentence separately, then you should not worry about it.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "def fit_vectorizer(corpus):\n",
    "    \"\"\"\n",
    "    Instantiates the vectorizer class on the corpus\n",
    "    \n",
    "    Args:\n",
    "        corpus (list): List with the sentences.\n",
    "    \n",
    "    Returns:\n",
    "        (tf.keras.layers.TextVectorization): an instance of the TextVectorization class containing the word-index dictionary, adapted to the corpus sentences.\n",
    "    \"\"\"    \n",
    "\n",
    "    tf.keras.utils.set_random_seed(65) # Do not change this line or you may have different expected outputs throughout the assignment\n",
    "\n",
    "    ### START CODE HERE ###\n",
    "\n",
    "    # Define the object\n",
    "    vectorizer = None\n",
    "    \n",
    "    # Adapt it to the corpus\n",
    "\n",
    "\n",
    "    ### END CODE HERE ###\n",
    "    \n",
    "    return vectorizer"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "vectorizer = fit_vectorizer(corpus)\n",
    "total_words = len(vectorizer.get_vocabulary())\n",
    "print(f\"Total number of words in corpus (including the out of vocabulary): {total_words}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Expected output:**\n",
    "\n",
    "```\n",
    "Total number of words in corpus (including the out of vocabulary): 3189\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "77-0sA46OETa"
   },
   "source": [
    "One thing to note is that you can either pass a string or a list of strings to vectorizer. If you pass the former, it will return a *tensor* whereas if you pass the latter, it will return a *ragged tensor* if you've correctly configured the TextVectorization layer to do so."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "tqhPxdeXlfjh"
   },
   "outputs": [],
   "source": [
    "print(f\"Passing a string directly: {vectorizer('This is a test string').__repr__()}\")\n",
    "print(f\"Passing a list of strings: {vectorizer(['This is a test string'])}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Expected output:**\n",
    "\n",
    "```\n",
    "Passing a string directly: <tf.Tensor: shape=(5,), dtype=int64, numpy=array([  29,   14,   18,    1, 1679])>\n",
    "Passing a list of strings: <tf.RaggedTensor [[29, 14, 18, 1, 1679]]>\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_fit_vectorizer(fit_vectorizer)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "-oqy9KjXRJ9A"
   },
   "source": [
    "## Generating n-grams\n",
    "\n",
    "As you saw in the lecture, the idea now is to generate the n-grams for each sentence in the corpus. So, for instance, if a vectorized sentence is given by `[45, 75, 195, 879]`, you must generate the following vectors:\n",
    "\n",
    "```Python\n",
    "[45, 75]\n",
    "[45, 75, 195]\n",
    "[45, 75, 195, 879]\n",
    "```\n",
    "## Exercise 2: n_grams_seqs\n",
    "\n",
    "Now complete the `n_gram_seqs` function below. This function receives the fitted vectorizer and the corpus (which is a list of strings) and should return a list containing the `n_gram` sequences for each line in the corpus.\n",
    "\n",
    "**NOTE:**\n",
    "\n",
    "- If you pass `vectorizer(sentence)` the result is not padded, whereas if you pass `vectorizer(list_of_sentences)`, the result won't be padded **only if you passed the argument `ragged = True`** in the TextVectorization setup.\n",
    "- This exercise directly depends on the previous one, because you need to pass the defined vectorizer as a parameter, so any error thrown in the previous exercise may propagate here."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "id": "iy4baJMDl6kj",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: n_gram_seqs\n",
    "\n",
    "def n_gram_seqs(corpus, vectorizer):\n",
    "    \"\"\"\n",
    "    Generates a list of n-gram sequences\n",
    "    \n",
    "    Args:\n",
    "        corpus (list of string): lines of texts to generate n-grams for\n",
    "        vectorizer (tf.keras.layers.TextVectorization): an instance of the TextVectorization class adapted in the corpus\n",
    "    \n",
    "    Returns:\n",
    "        (list of tf.int64 tensors): the n-gram sequences for each line in the corpus\n",
    "    \"\"\"\n",
    "    input_sequences = []\n",
    "\n",
    "    ### START CODE HERE ###\n",
    "    \n",
    "    \n",
    "    ### END CODE HERE ###\n",
    "    \n",
    "    return input_sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "DlKqW2pfM7G3"
   },
   "outputs": [],
   "source": [
    "# Test your function with one example\n",
    "first_example_sequence = n_gram_seqs([corpus[0]], vectorizer)\n",
    "\n",
    "print(\"n_gram sequences for first example look like this:\\n\")\n",
    "first_example_sequence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "0HL8Ug6UU0Jt"
   },
   "source": [
    "**Expected Output:**\n",
    "\n",
    "```\n",
    "n_gram sequences for first example look like this:\n",
    "\n",
    "[<tf.Tensor: shape=(2,), dtype=int64, numpy=array([ 35, 489])>,\n",
    " <tf.Tensor: shape=(3,), dtype=int64, numpy=array([  35,  489, 1259])>,\n",
    " <tf.Tensor: shape=(4,), dtype=int64, numpy=array([  35,  489, 1259,  164])>,\n",
    " <tf.Tensor: shape=(5,), dtype=int64, numpy=array([  35,  489, 1259,  164,  230])>,\n",
    " <tf.Tensor: shape=(6,), dtype=int64, numpy=array([  35,  489, 1259,  164,  230,  582])>]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "wtPpCcBjNc4c"
   },
   "outputs": [],
   "source": [
    "# Test your function with a bigger corpus\n",
    "next_3_examples_sequence = n_gram_seqs(corpus[1:4], vectorizer)\n",
    "\n",
    "print(\"n_gram sequences for next 3 examples look like this:\\n\")\n",
    "next_3_examples_sequence"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "EIzecMczU9UB"
   },
   "source": [
    "**Expected Output:**\n",
    "\n",
    "```\n",
    "n_gram sequences for next 3 examples look like this:\n",
    "\n",
    "[<tf.Tensor: shape=(2,), dtype=int64, numpy=array([  9, 935])>,\n",
    " <tf.Tensor: shape=(3,), dtype=int64, numpy=array([  9, 935, 143])>,\n",
    " <tf.Tensor: shape=(4,), dtype=int64, numpy=array([  9, 935, 143, 369])>,\n",
    " <tf.Tensor: shape=(5,), dtype=int64, numpy=array([  9, 935, 143, 369, 101])>,\n",
    " <tf.Tensor: shape=(6,), dtype=int64, numpy=array([  9, 935, 143, 369, 101, 171])>,\n",
    " <tf.Tensor: shape=(7,), dtype=int64, numpy=array([  9, 935, 143, 369, 101, 171, 207])>,\n",
    " <tf.Tensor: shape=(2,), dtype=int64, numpy=array([17, 23])>,\n",
    " <tf.Tensor: shape=(3,), dtype=int64, numpy=array([17, 23,  3])>,\n",
    " <tf.Tensor: shape=(4,), dtype=int64, numpy=array([  17,   23,    3, 1006])>,\n",
    " <tf.Tensor: shape=(5,), dtype=int64, numpy=array([  17,   23,    3, 1006,   64])>,\n",
    " <tf.Tensor: shape=(6,), dtype=int64, numpy=array([  17,   23,    3, 1006,   64,   31])>,\n",
    " <tf.Tensor: shape=(7,), dtype=int64, numpy=array([  17,   23,    3, 1006,   64,   31,   51])>,\n",
    " <tf.Tensor: shape=(8,), dtype=int64, numpy=array([  17,   23,    3, 1006,   64,   31,   51,  803])>,\n",
    " <tf.Tensor: shape=(2,), dtype=int64, numpy=array([ 27, 315])>,\n",
    " <tf.Tensor: shape=(3,), dtype=int64, numpy=array([ 27, 315, 745])>,\n",
    " <tf.Tensor: shape=(4,), dtype=int64, numpy=array([ 27, 315, 745, 101])>,\n",
    " <tf.Tensor: shape=(5,), dtype=int64, numpy=array([ 27, 315, 745, 101, 209])>,\n",
    " <tf.Tensor: shape=(6,), dtype=int64, numpy=array([ 27, 315, 745, 101, 209,  27])>,\n",
    " <tf.Tensor: shape=(7,), dtype=int64, numpy=array([ 27, 315, 745, 101, 209,  27, 286])>]\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_n_gram_seqs(n_gram_seqs)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "dx3V_RjFWQSu"
   },
   "source": [
    "Apply the `n_gram_seqs` transformation to the whole corpus and save the maximum sequence length to use it later:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "laMwiRUpmuSd"
   },
   "outputs": [],
   "source": [
    "# Apply the n_gram_seqs transformation to the whole corpus\n",
    "input_sequences = n_gram_seqs(corpus, vectorizer)\n",
    "\n",
    "# Save max length \n",
    "max_sequence_len = max([len(x) for x in input_sequences])\n",
    "\n",
    "print(f\"n_grams of input_sequences have length: {len(input_sequences)}\")\n",
    "print(f\"maximum length of sequences is: {max_sequence_len}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "2OciMdmEdE9L"
   },
   "source": [
    "**Expected Output:**\n",
    "\n",
    "```\n",
    "n_grams of input_sequences have length: 15355\n",
    "maximum length of sequences is: 11\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "zHY7HroqWq12"
   },
   "source": [
    "## Exercise 3: pad_seqs\n",
    "\n",
    "Now code the `pad_seqs` function which will pad any given sequences to the desired maximum length. Notice that this function receives a list of sequences and should return a numpy array with the padded sequences. You may have a look at the documentation of [`tf.keras.utils.pad_sequences`](https://www.tensorflow.org/api_docs/python/tf/keras/utils/pad_sequences). \n",
    "\n",
    "**NOTE**: \n",
    "\n",
    "- Remember to pass the correct padding method as discussed in the lecture."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "code",
    "deletable": false,
    "id": "WW1-qAZaWOhC",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: pad_seqs\n",
    "\n",
    "def pad_seqs(input_sequences, max_sequence_len):\n",
    "    \"\"\"\n",
    "    Pads tokenized sequences to the same length\n",
    "    \n",
    "    Args:\n",
    "        input_sequences (list of int): tokenized sequences to pad\n",
    "        maxlen (int): maximum length of the token sequences\n",
    "    \n",
    "    Returns:\n",
    "        (np.array of int32): tokenized sequences padded to the same length\n",
    "    \"\"\"\n",
    "    \n",
    "    ### START CODE HERE ###\n",
    "\n",
    "    padded_sequences = None\n",
    "\n",
    "    ### END CODE HERE ###\n",
    "    \n",
    "    return padded_sequences"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "IqVQ0pb3YHLr"
   },
   "outputs": [],
   "source": [
    "# Test your function with the n_grams_seq of the first example\n",
    "first_padded_seq = pad_seqs(first_example_sequence, max([len(x) for x in first_example_sequence]))\n",
    "first_padded_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "Re_avDznXRnU"
   },
   "source": [
    "**Expected Output:**\n",
    "\n",
    "```\n",
    "array([[   0,    0,    0,    0,   35,  489],\n",
    "       [   0,    0,    0,   35,  489, 1259],\n",
    "       [   0,    0,   35,  489, 1259,  164],\n",
    "       [   0,   35,  489, 1259,  164,  230],\n",
    "       [  35,  489, 1259,  164,  230,  582]], dtype=int32)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "j56_UCOBYzZt"
   },
   "outputs": [],
   "source": [
    "# Test your function with the n_grams_seq of the next 3 examples\n",
    "next_3_padded_seq = pad_seqs(next_3_examples_sequence, max([len(s) for s in next_3_examples_sequence]))\n",
    "next_3_padded_seq"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "3rmcDluOXcIU"
   },
   "source": [
    "**Expected Output:**\n",
    "\n",
    "```\n",
    "array([[   0,    0,    0,    0,    0,    0,    9,  935],\n",
    "       [   0,    0,    0,    0,    0,    9,  935,  143],\n",
    "       [   0,    0,    0,    0,    9,  935,  143,  369],\n",
    "       [   0,    0,    0,    9,  935,  143,  369,  101],\n",
    "       [   0,    0,    9,  935,  143,  369,  101,  171],\n",
    "       [   0,    9,  935,  143,  369,  101,  171,  207],\n",
    "       [   0,    0,    0,    0,    0,    0,   17,   23],\n",
    "       [   0,    0,    0,    0,    0,   17,   23,    3],\n",
    "       [   0,    0,    0,    0,   17,   23,    3, 1006],\n",
    "       [   0,    0,    0,   17,   23,    3, 1006,   64],\n",
    "       [   0,    0,   17,   23,    3, 1006,   64,   31],\n",
    "       [   0,   17,   23,    3, 1006,   64,   31,   51],\n",
    "       [  17,   23,    3, 1006,   64,   31,   51,  803],\n",
    "       [   0,    0,    0,    0,    0,    0,   27,  315],\n",
    "       [   0,    0,    0,    0,    0,   27,  315,  745],\n",
    "       [   0,    0,    0,    0,   27,  315,  745,  101],\n",
    "       [   0,    0,    0,   27,  315,  745,  101,  209],\n",
    "       [   0,    0,   27,  315,  745,  101,  209,   27],\n",
    "       [   0,   27,  315,  745,  101,  209,   27,  286]], dtype=int32)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_pad_seqs(pad_seqs)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "rgK-Q_micEYA"
   },
   "outputs": [],
   "source": [
    "# Pad the whole corpus\n",
    "input_sequences = pad_seqs(input_sequences, max_sequence_len)\n",
    "\n",
    "print(f\"padded corpus has shape: {input_sequences.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "id": "59RD1YYNc7CW"
   },
   "source": [
    "**Expected Output:**\n",
    "\n",
    "```\n",
    "padded corpus has shape: (15355, 11)\n",
    "\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "ZbOidyPrXxf7"
   },
   "source": [
    "## Exercise 4: features_and_labels_dataset\n",
    "\n",
    "Before feeding the data into the neural network you should split it into features and labels. In this case the features will be the *padded n_gram sequences* with the **last element** removed from them and the labels will be the removed words.\n",
    "\n",
    "Complete the `features_and_labels_dataset` function below. This function expects the `padded n_gram sequences` as input and should return a **batched** [tensorflow dataset](https://www.tensorflow.org/api_docs/python/tf/data/Dataset) containing elements in the form (sentence, label). \n",
    "\n",
    "\n",
    "**NOTE**:\n",
    "- Notice that the function also receives the total of words in the corpus, this parameter will be **very important when one hot encoding the labels** since every word in the corpus will be a label at least once. The function you should use is [`tf.keras.utils.to_categorical`]((https://www.tensorflow.org/api_docs/python/tf/keras/utils/to_categorical)).\n",
    "- To generate a dataset you may use the function [tf.data.Dataset.from_tensor_slices](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#from_tensor_slices) after obtaining the sentences and their respective labels.\n",
    "- To batch a dataset, you may call the method [.batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch). A good number is `16`, but feel free to choose any number you want to, but keep it not greater than 64, otherwise the model may take too many epochs to achieve a good accuracy. Remember this is defined as a global variable."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "code",
    "deletable": false,
    "id": "9WGGbYdnZdmJ",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: features_and_labels\n",
    "\n",
    "def features_and_labels_dataset(input_sequences, total_words):\n",
    "    \"\"\"\n",
    "    Generates features and labels from n-grams and returns a tensorflow dataset\n",
    "    \n",
    "    Args:\n",
    "        input_sequences (list of int): sequences to split features and labels from\n",
    "        total_words (int): vocabulary size\n",
    "    \n",
    "    Returns:\n",
    "        (tf.data.Dataset): Dataset with elements in the form (sentence, label)\n",
    "    \"\"\"\n",
    "    ### START CODE HERE ###\n",
    "\n",
    "    # Define the features an labels as discussed in the lectures\n",
    "    features = None\n",
    "    labels = None\n",
    "\n",
    "    # One hot encode the labels\n",
    "    one_hot_labels = None\n",
    "\n",
    "    # Build the dataset with the features and one hot encoded labels\n",
    "    dataset = None \n",
    "\n",
    "    # Batch de dataset with number of batches given by the global variable\n",
    "    batched_dataset = None\n",
    "\n",
    "    ### END CODE HERE ##\n",
    "\n",
    "    return batched_dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "23DolaBRaIAZ"
   },
   "outputs": [],
   "source": [
    "# Test your function with the padded n_grams_seq of the first example\n",
    "dataset_example = features_and_labels_dataset(first_padded_seq, total_words)\n",
    "\n",
    "print(\"Example:\\n\")\n",
    "for features, label in dataset_example.take(1):\n",
    "    print(f\"N grams:\\n\\n {features}\\n\")\n",
    "    print(f\"Label shape:\\n\\n {label.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "7t4yAx2UaQ43"
   },
   "source": [
    "**Expected Output:**\n",
    "\n",
    "```\n",
    "Example:\n",
    "\n",
    "N grams:\n",
    "\n",
    " [[   0    0    0    0   35]\n",
    " [   0    0    0   35  489]\n",
    " [   0    0   35  489 1259]\n",
    " [   0   35  489 1259  164]\n",
    " [  35  489 1259  164  230]]\n",
    "\n",
    "Label shape:\n",
    "\n",
    " (5, 3189)\n",
    "```"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_features_and_labels_dataset(features_and_labels_dataset)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "Now let's generate the whole dataset that will be used for training. In this case, let's use the [.prefetch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#prefetch) method to speed up the training. Since the dataset is not that big, you should not have problems with memory by doing this. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "GRTuLEt3bRKa"
   },
   "outputs": [],
   "source": [
    "# Split the whole corpus\n",
    "dataset = features_and_labels_dataset(input_sequences, total_words).prefetch(tf.data.AUTOTUNE)\n",
    "\n",
    "print(f\"Feature shape: {dataset.element_spec[0]}\")\n",
    "print(f\"Label shape: {dataset.element_spec[1]}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "xXSMK_HpdLns"
   },
   "source": [
    "**Expected Output:**\n",
    "\n",
    "```\n",
    "Feature shape: TensorSpec(shape=(None, 10), dtype=tf.int32, name=None)\n",
    "Label shape: TensorSpec(shape=(None, 3189), dtype=tf.float32, name=None)\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "ltxaOCE_aU6J"
   },
   "source": [
    "## Exercise 5: create_model\n",
    "\n",
    "Now you should define a model architecture capable of achieving an accuracy of at least 80%.\n",
    "\n",
    "Some hints to help you in this task:\n",
    "\n",
    "- The first layer in your model must be an [Input](https://www.tensorflow.org/api_docs/python/tf/keras/Input) layer with the appropriate parameters, remember that your input are vectors with a fixed length size. Be careful with the size value you should pass as you've removed the last element of every input to be the label.\n",
    "\n",
    "- An appropriate `output_dim` for the first layer (Embedding) is 100, this is already provided for you.\n",
    "- A Bidirectional LSTM is helpful for this particular problem.\n",
    "- The last layer should have the same number of units as the total number of words in the corpus and a softmax activation function.\n",
    "- This problem can be solved with only two layers (excluding the Embedding and Input) so try out small architectures first.\n",
    "- 30 epochs should be enough to get an accuracy higher than 80%, if this is not the case try changing the architecture of your model."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "cellView": "code",
    "deletable": false,
    "id": "XrE6kpJFfvRY",
    "tags": [
     "graded"
    ]
   },
   "outputs": [],
   "source": [
    "# GRADED FUNCTION: create_model\n",
    "\n",
    "def create_model(total_words, max_sequence_len):\n",
    "    \"\"\"\n",
    "    Creates a text generator model\n",
    "    \n",
    "    Args:\n",
    "        total_words (int): size of the vocabulary for the Embedding layer input\n",
    "        max_sequence_len (int): length of the input sequences\n",
    "    \n",
    "    Returns:\n",
    "       (tf.keras Model): the text generator model\n",
    "    \"\"\"\n",
    "    model = tf.keras.Sequential()\n",
    "\n",
    "    ### START CODE HERE ###\n",
    "    model.add(tf.keras.layers.Input(None))\n",
    "    model.add(tf.keras.layers.Embedding(None, None))\n",
    "\n",
    "\n",
    "    # Compile the model\n",
    "    model.compile(loss=None,\n",
    "                  optimizer=None,\n",
    "                  metrics = None)\n",
    "    \n",
    "    ### END CODE HERE ###\n",
    "\n",
    "    return model"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next cell allows you to check the number of total and trainable parameters of your model and prompts a warning in case these exceeds those of a reference solution, this serves the following 3 purposes listed in order of priority:\n",
    "\n",
    "- Helps you prevent crashing the kernel during training.\n",
    "\n",
    "- Helps you avoid longer-than-necessary training times.\n",
    "- Provides a reasonable estimate of the size of your model. In general you will usually prefer smaller models given that they accomplish their goal successfully.\n",
    "\n",
    "**Notice that this is just informative** and may be very well below the actual limit for size of the model necessary to crash the kernel. So even if you exceed this reference you are probably fine. However, **if the kernel crashes during training or it is taking a very long time and your model is larger than the reference, come back here and try to get the number of parameters closer to the reference.**"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "0IpX_Gu_gISk",
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get the untrained model\n",
    "model = create_model(total_words, max_sequence_len)\n",
    "\n",
    "# Check the parameter count against a reference solution\n",
    "unittests.parameter_count(model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "example_batch = dataset.take(1)\n",
    "\n",
    "try:\n",
    "\tmodel.evaluate(example_batch, verbose=False)\n",
    "except:\n",
    "\tprint(\"Your model is not compatible with the dataset you defined earlier. Check that the loss function and last layer are compatible with one another.\")\n",
    "else:\n",
    "\tpredictions = model.predict(example_batch, verbose=False)\n",
    "\tprint(f\"predictions have shape: {predictions.shape}\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "source": [
    "**Expected output:**\n",
    "\n",
    "```\n",
    "predictions have shape: (NUM_BATCHES, 3189)\n",
    "```\n",
    "\n",
    "Where `NUM_BATCHES` is the number of batches you have set to your dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false
   },
   "outputs": [],
   "source": [
    "# Test your code!\n",
    "unittests.test_create_model(create_model)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false
   },
   "outputs": [],
   "source": [
    "# Train the model\n",
    "history = model.fit(dataset, epochs=30, verbose=1)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "gy72RPgly55q"
   },
   "source": [
    "**To pass this assignment, your model should achieve a training accuracy of at least 80%**. If your model didn't achieve this threshold, try training again with a different model architecture. Consider increasing the number of units in your `LSTM` layer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "1fXTEO3GJ282",
    "tags": []
   },
   "outputs": [],
   "source": [
    "# Get training and validation accuracies\n",
    "acc = history.history['accuracy']\n",
    "loss = history.history['loss']\n",
    "\n",
    "# Get number of epochs\n",
    "epochs = range(len(acc))\n",
    "\n",
    "fig, ax = plt.subplots(1, 2, figsize=(10, 5))\n",
    "fig.suptitle('Training performance - Accuracy and Loss')\n",
    "\n",
    "for i, (data, label) in enumerate(zip([acc,loss], [\"Accuracy\", \"Loss\"])):\n",
    "    ax[i].plot(epochs, data, label=label)\n",
    "    ax[i].legend()\n",
    "    ax[i].set_xlabel('epochs')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "OjvED5A3qrn2"
   },
   "source": [
    "If the accuracy meets the requirement of being greater than 80%, then save the `history.pkl` file which contains the information of the training history of your model and will be used to compute your grade. You can do this by running the following code:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "9QRG73l6qE-c",
    "tags": []
   },
   "outputs": [],
   "source": [
    "with open('history.pkl', 'wb') as f:\n",
    "    pickle.dump(history.history, f)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "wdsMszk9zBs_"
   },
   "source": [
    "## See your model in action\n",
    "\n",
    "After all your work it is finally time to see your model generating text. \n",
    "\n",
    "Run the cell below to generate the next 100 words of a seed text.\n",
    "\n",
    "After submitting your assignment you are encouraged to try out training for different amounts of epochs and seeing how this affects the coherency of the generated text. Also try changing the seed text to see what you get!"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "deletable": false,
    "id": "6Vc6PHgxa6Hm",
    "tags": []
   },
   "outputs": [],
   "source": [
    "seed_text = \"Help me Obi Wan Kenobi, you're my only hope\"\n",
    "next_words = 100\n",
    "  \n",
    "for _ in range(next_words):\n",
    "    # Convert the text into sequences\n",
    "    token_list = vectorizer(seed_text)\n",
    "    # Pad the sequences\n",
    "    token_list = tf.keras.utils.pad_sequences([token_list], maxlen=max_sequence_len-1, padding='pre')\n",
    "    # Get the probabilities of predicting a word\n",
    "    predicted = model.predict([token_list], verbose=0)\n",
    "    # Choose the next word based on the maximum probability\n",
    "    predicted = np.argmax(predicted, axis=-1).item()\n",
    "    # Get the actual word from the word index\n",
    "    output_word = vectorizer.get_vocabulary()[predicted]\n",
    "    # Append to the current text\n",
    "    seed_text += \" \" + output_word\n",
    "\n",
    "print(seed_text)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "deletable": false,
    "editable": false,
    "id": "6r-X-HXtSc8N"
   },
   "source": [
    "**Congratulations on finishing this week's assignment!**\n",
    "\n",
    "You have successfully implemented a neural network capable of predicting the next word in a sequence of text!\n",
    "\n",
    "**We hope to see you in the next course of the specialization! Keep it up!**"
   ]
  }
 ],
 "metadata": {
  "accelerator": "GPU",
  "colab": {
   "provenance": []
  },
  "dlai_version": "1.2.0",
  "grader_version": "1",
  "jupytext": {
   "main_language": "python"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.7"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}
